Problem Statement: Loan Approval Prediction Problem Type: Binary Classification Loan approval prediction is classic problem to learn and apply lots of data analysis techniques to create best Classification model.

Given with the dataset consisting of details of applicants for loan and status whether the loan application is approved or not. Basis on the a binary classification model is to be created with maximum accuracy.

Seems need to work on data preperation

-Loan Amount column does is not fit in Normal Distribution

-Outliers in Applicant's Income and Co-applicant's income

Normal Distribution

Central limit theorem In simple language we can say that maximum amount of data / or maximum number of data points are near the Mean of the all data points.

To validate he normal distribution of the data:- Mean Mode Median are Equal.\n

We can gen identified the distribution of entire data with the help of Mean and Standard Deviation.

When the data is normally distributed maximum data is centralized near the mean value of the data.

To get understanding of distribtuion we can simply plot Distribution plot i.e. Simple Histogram.

Normally Distributed data represents a Bell Shaped curve.

Also Mean , Mode , Median on Normaly Distributed data are equal (Mean=Mode=Median)

One more method is to calculate mean which should be 0 or near to 0 and Standard deviation 1 or near 1.

Mean = sum(All Data Points)/count(Data Points)

Standard Deviation = Root of { sum [Square (each data point - mean of whole data) ] }

From above graphs found these variables are not normaly distributed.

Foud right-skewed distribution in these three variabels.

Prepare data for model training i.e. removing ouliers , filling null values , removing skewness

->Taking mode of values in a column will be best way to fill null values. ->Not mean because values are not ordinal but are categorical.

Now we can see that Bell Curve for all three variables and data is normally distributed now.

Feature Importance

In order to create best predictive model we need to best understand the available data and get most information from the data.

In multivariate data it is important to understand the iortance of varialbes and how much they are contributing towards the target variable. Such that we can remove unnecessary variables to increase model performance.

Many times dataset consists of exta columns which do not identically serve information to classify the data. This leads in Wrong Assumption of model while training.

To understand the importance of the data we are going to use Machine Learning classifiers and then will plot bar graph based on importance.

Also XGBoost has built-in Feature Importance Plotting tool which we are going to use.

Using more than one classifier will increase the confidence on our assumption of which variables to keep and which to remove.

From feature importance => Credit History , ApplicantIncome , CoapplicantIncome, LoanAmount are the most important features

Is data Balanced ?

It seems Application income and Loan Amount is correlated , also Coapplication income correlated with Loan Aount then Credit history is corrleated with Loan Status

It seems that data is highly Imbalanced.

When the target classes does not have equal count then the data is considered as imbalanced data.

From above graph it seems that dataset contains more records with Approved Loan_Status than Rejected Loan_Status. 422 over 192

If data would have maximum of 20-30 records difference that time this imabalnced would be ignorable.

Which will lead to make wrong assumptions by model and also model will be biased after training. We will overcome this issue by balancing the data.

To overcome this problem we will balance the data using Resampling technique with Upsample and Downsample.

Data Standardization / Normalization

Data normalization is required when the vriable values are in very distinct range.

For Ex. Suppose we have 2 columns "Age" and "Income"

Where value range of "Age" lying in 0-100 Approx. and value range of "Income" lying in 20,000 to 100,000

At this time model will perform poorly on testig data as all input values are not in same value range.

So not every time but whenever we get such type of data we need to normalized it i.e. Rescale it.

Widely used scaling tools are Min-Max Scaler and Standard-Scaler

Data Normalization is done by Min-Max Scaler which scales all th values between 0 to 1 range.

Data standardization is done by Standard-Scaler which scales the data so that Mean of observed data is 0 and Standard Deviation is 1.

As our data is not much normally distributed we will choose Standardization using Standard-Scaler aiming that it will reduce more skewness and contribute in accuracy gain.

Experimental Modeling

In order to gain maximum posible accuracy one needs to conduct much emor experiments.

We will pass data on by one with different state i.e.

-Only Scaled data

-Scaled + Down Sampled Data

-Scaled + Up Sampled Data

-Scaled + Up Sampled Data + Selected feature with respective importance.

Conclusion

Experiment 1 : Scaled data only

Support Vector Machine 83.116

Decision Tree 83.1168

Linear Discriminant Analysis 83.166

KNearest Neighbors 83.766

Gaussian Naivey Bayes 83.116

Logistic Regression 83.116

Experiment 2: Sclaed + Down Sampled Data

AdaBoost 73.95

Decision Tree 72.91

Voting Ensemble 71.87

Experiment 3: Sclaed + Up Sampled Data

Random Forest only 83.88

Experiment 4: Sclaed + Selected features with respective importance

Support Vector Machine 83.11

Decision Tree 83.11

AdaBoost 82.46

Linear Discriminant Analysis 83.11

KNearest Neighbors 83.11

Gaussian Naivey Bayes 83.11

Logistic Regression 83.11

Also after parameter tuning with

KNN 83.11

After all possible experiments Maximum accuracy achieved By making data balanced as Up Sampling. Surprisingly only Random forest performed well in that state of the data.